Retrieval effectiveness study with Farsi language
نویسندگان
چکیده
Having Farsi as the underlying language and using a test collection of 166,774 documents and 100 topics, this experiment evaluates the retrieval effectiveness of different IR models while using a light and a plural stemmer as well as n-grams and trunc-n indexing strategies. Moreover the impact of stoplist removal is evaluated. According to the obtained results the DFR-I(ne)C2 model is the best performing one. The proposed light and plural stemmer improve the retrieval performance compare to non-stemming approach. Indexing strategies trunc-4 and trunc-5 have also a positive impact on the performance while 3-grams and trunc-3 have the most negative impact on the results. The results reveal that for Farsi stoplist removal plays an important role in improving the retrieval performance. A query-byquery analysis on the results shows that avoiding extreme results would be possible by adding extra controls and rules, according to Farsi morphology, to the stemming algorithms. RSUM. Dans le but dÕutiliser le persan comme langue de rfrence, et en utilisant une collection test de 166 774 documents et de 100 requtes, cette tude value la performance des diffrents modles de RI sur lesquels sont appliques diverses stratgies dÕindexation et de recherche. De plus, cette tude value lÕimpact de lÕlimination de la liste des mots-outils lors de lÕindexation. Selon les rsultats obtenus, le modle DFR-I(ne)C2 est le plus performant. LÕenracineur lger et lÕenracineur pluriel amliorent la performance en comparaison lÕapproche sans enracineur. Les stratgies dÕindexation, comme tronc-4 et tronc-5 amliorent la performance, alors que les approches comme 3-grams et tronc-3 ont lÕimpact le plus ngatif sur les rsultats. Les rsultats rvlent que lÕlimination de la liste des mots-outils joue un rle important dans l'amlioration de la performance. L'analyse requtes par requtes montre quÕil serait possible dÕajouter des rgles supplmentaires aux enracineurs, pour viter des rsultats errons.
منابع مشابه
Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine
Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...
متن کاملAssessment of a Modern Farsi Corpus
The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on...
متن کاملDesigning a Distributed search engine for Farsi/English web pages
In this paper we have tried to model, design and test a prototype of Farsi/English search engine. The engine has the duty of covering the web media features such as heterogeneity, volatility and huge amount of unstructured worldwide information. These features as well as the rapid advance in technology, challenge the effectiveness of classical Information Retrieval (IR) techniques. Although a g...
متن کاملAd Hoc Retrieval with the Persian Language
This paper describes our participation to the Persian ad hoc search during the CLEF 2009 evaluation campaign. In this task, we suggest using a light suffix-stripping algorithm for the Farsi (or Persian) language. The evaluations based on different probabilistic models demonstrated that our stemming approach performs better than a stemmer removing only the plural suffixes, or statistically bette...
متن کاملUniNE at CLEF 2008: TEL, and Persian IR
In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012